HW 3 Nonlinear & Nonparametric Regression: Data Analysis Problems

Advanced Regression (STAT 353-0)

Author

Zihan Chu

Published

February 16, 2025

Github Repo Link

To link to your github repository, appropriately edit the example link below. Meaning replace https://your-github-repo-url with your github repo url. Suggest verifying the link works before submitting.

https://github.com/karliechu1023/myrepo.git

Important

All students are required to complete this problem set!

Load packages & data

library(ggplot2)
library(car)
library(MASS)
library(plotly)
library(olsrr)

Data analysis problems

1. Exercise D17.1

The data in Ginzberg.txt (collected by Ginzberg) were analyzed by Monette (1990). The data are for a group of 82 psychiatric patients hospitalized for depression. The response variable in the data set is the patient’s score on the Beck scale, a widely used measure of depression. The explanatory variables are “simplicity” (measuring the degree to which the patient “sees the world in black and white”) and “fatalism”. (These three variables have been adjusted for other explanatory variables that can influence depression.) Use the adjusted scores for the analysis.

Using the full quadratic regression model

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1^2 + \beta_4 X_2^2 + \beta_5 X_1 X_2 + \epsilon\]

regress the Beck-scale scores on simplicity and fatalism.

  1. Are the quadratic and product terms needed here?
Solution
gin_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/Ginzberg.txt")
# with quadratic and product terms
model <- lm(adjdepression ~ adjsimplicity + adjfatalism + 
              I(adjsimplicity^2) + I(adjfatalism^2) + 
              adjsimplicity * adjfatalism, data = gin_data)

summary(model)

Call:
lm(formula = adjdepression ~ adjsimplicity + adjfatalism + I(adjsimplicity^2) + 
    I(adjfatalism^2) + adjsimplicity * adjfatalism, data = gin_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.63090 -0.23619 -0.06667  0.21295  1.04481 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)   
(Intercept)               -0.10865    0.20687  -0.525   0.6010   
adjsimplicity              0.82039    0.29702   2.762   0.0072 **
adjfatalism                0.55247    0.28391   1.946   0.0554 . 
I(adjsimplicity^2)         0.08172    0.15374   0.532   0.5966   
I(adjfatalism^2)           0.20927    0.18663   1.121   0.2657   
adjsimplicity:adjfatalism -0.55366    0.27855  -1.988   0.0505 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3738 on 76 degrees of freedom
Multiple R-squared:  0.4755,    Adjusted R-squared:  0.441 
F-statistic: 13.78 on 5 and 76 DF,  p-value: 1.416e-09
# without quadratic and product terms
model_linear <- lm(adjdepression ~ adjsimplicity + adjfatalism, data = gin_data)
anova(model_linear, model)
Analysis of Variance Table

Model 1: adjdepression ~ adjsimplicity + adjfatalism
Model 2: adjdepression ~ adjsimplicity + adjfatalism + I(adjsimplicity^2) + 
    I(adjfatalism^2) + adjsimplicity * adjfatalism
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     79 11.478                           
2     76 10.621  3   0.85711 2.0444 0.1147

From thw summary, we can see that the quadratic terms has insignificant p-values (0.5966,0.2657), indicating they do not contribute significantly. The interaction term is on the edge of significance (0.0505), suggesting a possible interaction effect but not strong enough to be conclusive. The F-test comparing the two models has p = 0.1147, which is not statistically significant. This suggests that adding the quadratic and interaction terms does not significantly improve the model fit.

  1. Graph the data and the fitted regression surface in three dimensions. Do you see any problems with the data?
Solution
grid <- expand.grid(
  adjsimplicity = seq(min(gin_data$adjsimplicity), max(gin_data$adjsimplicity), length.out = 30),
  adjfatalism = seq(min(gin_data$adjfatalism), max(gin_data$adjfatalism), length.out = 30)
)
grid$predicted <- predict(model, newdata = grid)
z_matrix <- matrix(grid$predicted, nrow = 30, byrow = TRUE)

fig <- plot_ly()

fig <- fig %>%
  add_trace(
    x = gin_data$adjsimplicity,
    y = gin_data$adjfatalism,
    z = gin_data$adjdepression,
    type = "scatter3d",
    mode = "markers",
    marker = list(size = 5, color = "blue"),
    name = "Actual Data"
  )

fig <- fig %>%
  add_trace(
    x = grid$adjsimplicity,
    y = grid$adjfatalism,
    z = z_matrix,
    type = "surface",
    colorscale = "Viridis",
    opacity = 0.6,
    name = "Regression Surface"
  )

fig

For data distribution: The blue dots appear to be clustered in a relatively compact region, mostly in the middle range of both simplicity and fatalism scores.However, there’s one notable outlier point that sits quite far from this main cluster.The actual data points don’t seem to follow any strong curved pattern. The relationship between the variables appears to be predominantly linear.

The surface’s curves and twists might be an artifact of overfitting, and appears to extend beyond the region where we have actual data points.The outlier point appears to be having a substantial influence on the fitted surface.The spread of the data points around the fitted surface suggests there might be considerable variability that isn’t being captured by our model.

  1. What do standard regression diagnostics for influential observations show?
Solution
influence_measures <- influence.measures(model)
influence_plot <- ols_plot_resid_lev(model)

# Cook's distance plot
plot(cooks.distance(model), 
     ylab = "Cook's Distance",
     main = "Influence Plot - Cook's Distance")
abline(h = 4/nrow(gin_data), col = "red", lty = 2)

par(mfrow = c(2,2))
plot(model)

  1. Cook’s Distance: Most observations have very low Cook’s distance values (close to 0). There are three notably influential points with higher Cook’s distances, particularly observations around indices 70-80. The red dashed line represents a common threshold for concerning influence (4/n). While these points have higher influence than others, they don’t appear to be extremely problematic as their Cook’s distances are still relatively modest.
  2. Residual: The residuals scatter fairly randomly around the horizontal line at zero and there’s no obvious pattern. The spread of residuals appears relatively consistent across fitted values.
  3. Normal Q-Q plot: Most points follow the diagonal line quite well though there’s some minor deviation at the extreme ends. Overall, the normality assumption appears reasonably satisfied
  4. Scale-Location: The relatively horizontal red line suggests fairly consistent variance.
  5. Residual vs Leverage: Most observations have low leverage and moderate residuals, and no points fall outside the dashed Cook’s distance lines, suggesting no severely problematic observations.

2. Exercise D18.2

For this analysis, use the States.txt data, which includes average SAT scores for each state as the outcome.

  1. Put together a model with SAT math (satMath) as the outcome and region, population, percentTaking, and teacherPay as the explanatory variables, each included as linear terms. Interpret the findings.
Solution
s_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/States.txt")
s_data$region = as.factor(s_data$region)
model1 <- lm(satMath ~ region + population + percentTaking + teacherPay, data = s_data)
summary(model1)

Call:
lm(formula = satMath ~ region + population + percentTaking + 
    teacherPay, data = s_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-25.0818  -7.4076   0.7687   8.6216  22.4538 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.617e+02  2.103e+01  26.706  < 2e-16 ***
regionESC     -9.925e+00  1.017e+01  -0.975   0.3353    
regionMA      -2.350e+00  1.316e+01  -0.179   0.8592    
regionMTN     -1.328e+01  8.925e+00  -1.488   0.1447    
regionNE       4.453e+00  1.286e+01   0.346   0.7311    
regionPAC     -8.657e+00  9.712e+00  -0.891   0.3782    
regionSA      -2.544e+01  1.023e+01  -2.488   0.0172 *  
regionWNC      1.548e+01  9.217e+00   1.679   0.1011    
regionWSC     -1.337e+01  1.054e+01  -1.269   0.2120    
population    -8.416e-05  3.993e-04  -0.211   0.8342    
percentTaking -1.088e+00  1.841e-01  -5.909 6.93e-07 ***
teacherPay     3.711e-01  5.184e-01   0.716   0.4783    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13.46 on 39 degrees of freedom
Multiple R-squared:  0.8836,    Adjusted R-squared:  0.8508 
F-statistic: 26.91 on 11 and 39 DF,  p-value: 7.914e-15
par(mfrow = c(2,2))
plot(model)

The model explains a substantial portion of the variance in SAT math scores, with an adjusted R-squared of 0.8508 (85.08%). The F-statistic (26.91) with a very small p-value (7.914e-15) indicates that the model as a whole is statistically significant. percentTaking has a strong negative relationship with SAT math score, with p-value 6.37e-07. Other variables show less significant impacts, and only regionSA is somehow more significant. For each 1% increase in students taking the SAT, the math score decreases by about 1.09 points, holding other variables constant. Teacher pay and population have minimal effects on SAT math scores.

The residuals appear fairly randomly scattered around zero. For QQ plot, the residuals follow the diagonal line quite well, although has some outliers, it still shows reasonably normal distribution of residuals. For the leverage plot, most points have relatively low leverage and no points fall outside Cook’s distance contours.

  1. Now, instead approach building this model using the nonparametric-regression methods discussed in Chapter 18 of our main course textbook, FOX. Fit a general nonparametric regression model and an additive-regression model, comparing the results to each other and to the linear least-squares fit to the data (in part (a))). If you have problems with categorical variables for the nonparametric models, feel free to remove them. Be sure to explain the models.
Solution
library(mgcv)  
library(np)
# 1. LOESS model
loess_model <- loess(satMath ~ population + percentTaking + teacherPay, data = s_data, span = 0.75)
s_data$loess_pred <- predict(loess_model)

ggplot(s_data, aes(x = percentTaking, y = satMath)) +
  geom_point() +
  geom_line(aes(y = loess_pred), color = "blue") +
  labs(title = "LOESS regression", x = "percent taking SAT", y = "SAT math score") +
  theme_minimal()

# 2. GAM model
gam_model <- gam(satMath ~ s(population) + s(percentTaking) + s(teacherPay), data = s_data)
summary(gam_model)

Family: gaussian 
Link function: identity 

Formula:
satMath ~ s(population) + s(percentTaking) + s(teacherPay)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  529.275      1.825     290   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                   edf Ref.df      F p-value    
s(population)    1.000  1.000  0.468  0.4975    
s(percentTaking) 2.928  3.631 65.111  <2e-16 ***
s(teacherPay)    4.214  5.169  3.066  0.0182 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.86   Deviance explained = 88.3%
GCV = 206.96  Scale est. = 169.86    n = 51
plot(gam_model,pages=1)

# compare R-squared values
lm_r2 <- summary(model1)$r.squared
gam_r2 <- summary(gam_model)$r.sq

print(paste("Linear Model R²:", lm_r2))
[1] "Linear Model R²: 0.883592199243035"
print(paste("GAM Model R²:", gam_r2))
[1] "GAM Model R²: 0.86001598359618"
linear_pred <- predict(model1)
gam_pred <- predict(gam_model)
loess_pred <- predict(loess_model)

par(mfrow = c(2,2))
# plot 1: linear vs LOESS predictions
plot(linear_pred, loess_pred,
     xlab = "Linear model predictions",
     ylab = "LOESS predictions",
     main = "Linear vs LOESS")
abline(0, 1, col = "red")

# plot 2: linear vs GAM
plot(linear_pred, gam_pred,
     xlab = "Linear Model Predictions",
     ylab = "GAM Predictions",
     main = "Linear vs GAM")
abline(0, 1, col = "red")

# plot 3: LOESS vs GAM predictions
plot(loess_pred, gam_pred,
     xlab = "LOESS Predictions",
     ylab = "GAM Predictions",
     main = "LOESS vs GAM")
abline(0, 1, col = "red")

From the first LOESS plot, we can see a strong negative relationship that’s consistently decreasing but not exactly linear.The blue curve represents the LOESS fit, showing a wiggly nonlinear relationship between percentTaking and SAT Math scores.

From the GAM curve, the almost horizontal line suggests population has minimal impact on SAT math scores, also the narrow confidence bands indicate this flat relationship (which correspond to the insignificant p-value0.4975). For teacher’s pay, it shows some influence, but not so significant. For Percent taking SAT, it shows a strong nonlinear relationship, and the effect is quite substantial (also the p-value < 2e-16)

For the comparison, all three models show strong agreement in their predictions. The R-squared values are very similar, and the linear model has a slightly bigger R-squared value.

Therefore, the linear model performs surprisingly well, despite missing some nonlinear patterns.

  1. Can you handle the nonlinearity by a transformation or by another parametric regression model, such as a polynomial regression? Investigate and explain. What are the tradeoffs between these nonparametric and parametric approaches?
Solution
# Fit polynomial regression model
poly_model <- lm(satMath ~ population + poly(percentTaking, 2) + teacherPay, 
                 data = s_data)

# log transformation model
log_model <- lm(satMath ~ population + log(percentTaking + 1) + teacherPay,
                data = s_data)
# Compare models
summary(poly_model)

Call:
lm(formula = satMath ~ population + poly(percentTaking, 2) + 
    teacherPay, data = s_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.573  -7.595  -0.608   7.067  29.224 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)              4.950e+02  1.489e+01  33.251  < 2e-16 ***
population               9.479e-05  3.783e-04   0.251   0.8033    
poly(percentTaking, 2)1 -2.379e+02  1.790e+01 -13.288  < 2e-16 ***
poly(percentTaking, 2)2  7.206e+01  1.467e+01   4.913 1.18e-05 ***
teacherPay               9.411e-01  4.225e-01   2.228   0.0308 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.25 on 46 degrees of freedom
Multiple R-squared:  0.8461,    Adjusted R-squared:  0.8327 
F-statistic: 63.22 on 4 and 46 DF,  p-value: < 2.2e-16
summary(log_model)

Call:
lm(formula = satMath ~ population + log(percentTaking + 1) + 
    teacherPay, data = s_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.246  -7.320  -0.058  10.093  26.201 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             6.103e+02  1.204e+01  50.704   <2e-16 ***
population             -1.797e-05  3.708e-04  -0.048   0.9616    
log(percentTaking + 1) -3.870e+01  2.743e+00 -14.106   <2e-16 ***
teacherPay              1.238e+00  4.192e-01   2.953   0.0049 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.33 on 47 degrees of freedom
Multiple R-squared:  0.8409,    Adjusted R-squared:  0.8307 
F-statistic: 82.79 on 3 and 47 DF,  p-value: < 2.2e-16
anova(poly_model, log_model)
Analysis of Variance Table

Model 1: satMath ~ population + poly(percentTaking, 2) + teacherPay
Model 2: satMath ~ population + log(percentTaking + 1) + teacherPay
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     46 9337.6                           
2     47 9654.6 -1   -317.02 1.5617 0.2177
plot(s_data$percentTaking, s_data$satMath,
     xlab = "Percent Taking SAT",
     ylab = "SAT Math Score",
     main = "Comparing Model Fits")

lines(sort(s_data$percentTaking), 
      fitted(poly_model)[order(s_data$percentTaking)],
      col = "blue", lwd = 2)
lines(sort(s_data$percentTaking),
      fitted(log_model)[order(s_data$percentTaking)],
      col = "red", lwd = 2)

For each unit increase in log(percentTaking + 1), SAT math scores decrease by 38.70 points. For each unit increase in teacherPay, scores increase by 1.238 points. Population has negligible effect (-1.797e-05).

Both polynomial terms for percentTaking are highly significant (p < 2e-16 and p < 1.18e-05).teacherPay is significant (p = 0.0308) with positive effect (0.9411). Population remains non-significant (p = 0.8033). For percent taking, when the test participation increases by 1%, assuming other variables constant, the expected SAT math score will decrease by 237.9 point initially but this decrease will slow down by 72.06% for each additional percent increase, showing a U-shaped relationship.

The summary shows that both log-transformation and the polynimial test have similar adjusted R-squared v alue. From the anova test, the p-value (0.2177) is not significant at the conventional 0.05 level. This means we fail to reject the null hypothesis that both models fit the data equally well.

The drawback for polynomial regression model is that we need to choose a reasonable degree of the polynomial, or it may be overfit the data. Also, it is less flexible than nonparametric approaches. The advantages are that it can capture the basic nonlinear pattern, reveal unexpected patterns in the data, and it is more interpretable coefficients than nonparametric methods.

For log-transformaton model, it is very simple to interpret and it can handle the diminishing effect. However, it may not capture more complex patterns and it imposes a specific form of nonlinearity. Therefore, since both models fit the data almost equally well, we can choose log transformation model for easier interpretation.

3. Exercise D18.3

Return to the Chile.txt dataset used in HW 2. Reanalyze the data employing generalized nonparametric regression (including generalized additive) models. As in HW2, you can remove abstained and undecided votes, and focus only on Yes and No votes.

  1. What, if anything, do you learn about the data from the nonparametric regression?
Solution
chile_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/Chile.txt")
chile_clean <- chile_data %>%
  filter(vote %in% c("Y", "N")) %>%
  mutate(
    vote_binary = ifelse(vote == "Y", 1, 0),
    region = as.factor(region),
    sex = as.factor(sex),
    education = as.factor(education)
  )
gam_chile <- gam(vote_binary ~ 
    s(age, bs = "cr") + s(income, bs = "cr", k=5) + s(statusquo, bs = "cr") +  sex + education + region,  family = binomial, data = chile_clean)

summary(gam_chile)

Family: binomial 
Link function: logit 

Formula:
vote_binary ~ s(age, bs = "cr") + s(income, bs = "cr", k = 5) + 
    s(statusquo, bs = "cr") + sex + education + region

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.03266    0.30450   3.391 0.000696 ***
sexM        -0.57778    0.20502  -2.818 0.004830 ** 
educationPS -0.96748    0.35672  -2.712 0.006685 ** 
educationS  -0.65136    0.25329  -2.572 0.010123 *  
regionM      0.68776    0.59989   1.146 0.251600    
regionN     -0.08501    0.36061  -0.236 0.813638    
regionS     -0.31113    0.29488  -1.055 0.291380    
regionSA    -0.12686    0.28111  -0.451 0.651777    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
               edf Ref.df  Chi.sq p-value    
s(age)       4.098  5.039   2.649   0.747    
s(income)    2.653  3.022   2.872   0.398    
s(statusquo) 1.581  1.958 520.739  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.766   Deviance explained = 70.6%
UBRE = -0.57269  Scale est. = 1         n = 1703
par(mfrow = c(2, 2))
plot(gam_chile, pages = 1)

From the model summary, the model explains 70.6% of the deviance (R= 0.766), indicating a good fit. Among the smooth terms, only statusquo shows strong significance (p < 2e-16).For one percent change in statusquo, the vote will change by 1.581 points. For parametric terms, sex and education levels are significant predictors, while region is not.

From the plots, statusquo shows a strong, nearly linear positive relationship with voting behavior. The relationship appears to be quite linear across the entire range. It has the strongest effect among all predictors.

Age shows a very weak, nearly flat relationship (p = 0.747), and the confidence bands are relatively wide, which means no significant nonlinear patterns are visible.

Income shows a slight nonlinear pattern but is not statistically significant (p = 0.398). Also, the relationship is relatively flat with minor fluctuations, and there is a wide confidence bands indicate high uncertainty.

  1. If the results appear to be substantially nonlinear, can you deal with the nonlinearity in a suitably respecified generalized linear model (e.g., by transforming one or more explanatory variables)? If they do not appear nonlinear, still try a transformation to see if anything changes.
Solution
chile_clean$income_scaled <- scale(chile_clean$income)
gam_scaled <- gam(
  vote_binary ~ 
    s(age, bs = "cr") +
    s(income_scaled, bs = "cr", k = 5) +
    s(statusquo, bs = "cr") +
    sex + education + region,
  family = binomial,
  data = chile_clean
)
summary(gam_scaled)

Family: binomial 
Link function: logit 

Formula:
vote_binary ~ s(age, bs = "cr") + s(income_scaled, bs = "cr", 
    k = 5) + s(statusquo, bs = "cr") + sex + education + region

Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.03266    0.30450   3.391 0.000696 ***
sexM        -0.57778    0.20502  -2.818 0.004830 ** 
educationPS -0.96748    0.35672  -2.712 0.006685 ** 
educationS  -0.65136    0.25329  -2.572 0.010123 *  
regionM      0.68776    0.59989   1.146 0.251600    
regionN     -0.08501    0.36061  -0.236 0.813638    
regionS     -0.31113    0.29488  -1.055 0.291380    
regionSA    -0.12686    0.28111  -0.451 0.651777    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                   edf Ref.df  Chi.sq p-value    
s(age)           4.098  5.039   2.649   0.747    
s(income_scaled) 2.653  3.022   2.872   0.398    
s(statusquo)     1.581  1.958 520.739  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.766   Deviance explained = 70.6%
UBRE = -0.57269  Scale est. = 1         n = 1703
par(mfrow = c(2, 2))
plot(gam_scaled, pages = 1)

4. Exercise E18.7

For this analysis, use the Duncan.txt data. Here we are interested in the outcome prestige and the explanatory variable income.

  1. Fit the local-linear regression of prestige on income with span \(s = 0.6\) (see Figure 18.7 in the book). This has 5.006 equivalent degrees of freedom, very close to the number of degrees of freedom for a fourth order polynomial.
Solution
d_data <- read.table("https://www.john-fox.ca/AppliedRegression/datasets/Duncan.txt")
library(KernSmooth) 
library(ggplot2)
model_loess <- loess(prestige ~ income, data=d_data, span=0.6, degree=2)
summary(model_loess)
Call:
loess(formula = prestige ~ income, data = d_data, span = 0.6, 
    degree = 2)

Number of Observations: 45 
Equivalent Number of Parameters: 5.1 
Residual Standard Error: 17.65 
Trace of smoother matrix: 5.6  (exact)

Control settings:
  span     :  0.6 
  degree   :  2 
  family   :  gaussian
  surface  :  interpolate     cell = 0.2
  normalize:  TRUE
 parametric:  FALSE
drop.square:  FALSE 
cat("Equivalent degrees of freedom:", model_loess$trace.hat)
Equivalent degrees of freedom: 5.600156
plot(d_data$income, d_data$prestige,
     xlab = "income",
     ylab = "prestige",
     main = "Local-Linear regression of prestige on income")

# Add the fitted line
income_grid <- seq(min(d_data$income), max(d_data$income), length.out = 100)
predicted <- predict(model_loess, newdata = data.frame(income = income_grid))
lines(income_grid, predicted, col = "blue", lwd = 2)

  1. Fit a fourth order polynomial of the data and compare the resulting regression curve with the local-linear regression. ::: {.callout-tip icon=“false”} ## Solution
poly_model <- lm(prestige ~ poly(income, 4), data=d_data)

income_grid <- seq(min(d_data$income), 
                  max(d_data$income), 
                  length.out = 100)

# Get predictions for both models
local_pred <- predict(model_loess, 
                     newdata = data.frame(income = income_grid))
poly_pred <- predict(poly_model, 
                    newdata = data.frame(income = income_grid))

plot(d_data$income, d_data$prestige,
     xlab = "Income",
     ylab = "Prestige",
     main = "comparison of Local-Linear and polynomial regression",
     pch = 16,col = "gray40")      

# Add both fitted lines
lines(income_grid, local_pred, 
      col = "blue", 
      lwd = 2,
      lty = 1)           
lines(income_grid, poly_pred, 
      col = "red", 
      lwd = 2,
      lty = 2)      

legend("topleft",
       legend = c("Local-Linear (span=0.6)", 
                 "4th Order Polynomial"),
       col = c("blue", "red"),
       lwd = 2,
       lty = c(1, 2),
       bg = "white")

cat("\nSummary of fourth-order polynomial fit:\n")

Summary of fourth-order polynomial fit:
print(summary(poly_model))

Call:
lm(formula = prestige ~ poly(income, 4), data = d_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.590  -9.181   2.714  10.295  60.323 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        47.689      2.615  18.233  < 2e-16 ***
poly(income, 4)1  175.114     17.545   9.981 2.04e-12 ***
poly(income, 4)2  -14.998     17.545  -0.855    0.398    
poly(income, 4)3  -10.084     17.545  -0.575    0.569    
poly(income, 4)4  -19.562     17.545  -1.115    0.272    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.55 on 40 degrees of freedom
Multiple R-squared:  0.7181,    Adjusted R-squared:   0.69 
F-statistic: 25.48 on 4 and 40 DF,  p-value: 1.538e-10
# R-squared for both fits for comparison
# For local-linear fit
local_fitted <- predict(model_loess, newdata = data.frame(income = d_data$income))
local_r2 <- 1 - sum((d_data$prestige - local_fitted)^2) / 
            sum((d_data$prestige - mean(d_data$prestige))^2)

cat("\nR-squared values:\n")

R-squared values:
cat("Local-linear R²:", round(local_r2, 4), "\n")
Local-linear R²: 0.7227 
cat("Polynomial R²:", round(summary(poly_model)$r.squared, 4), "\n")
Polynomial R²: 0.7181 

The plot shows that both models capture similar overall patterns in the data, with prestige generally increasing with income but showing some nonlinear behavior.They both start with a steep increase at lower income levels, shows some leveling off at higher incomes, and has a slight decrease at the very highest income levels.

However, the local-linear fit appears slightly smoother, and the polynomial fit shows slightly more curvature, especially in the middle range. The difference in R² is very small (about 0.0046), suggesting comparable overall fit quality. The simpler local-linear model performs slightly better in terms of R²

The polynomial regression statistics show a highly significant first-order term (p < 2e-16). Higher-order terms (2nd through 4th) are not individually significant (p-values > 0.05) :::